Objective: For independent variables X and dependent variables y, find coefficient matrix B and bias matrix c such that B*X + c = y
Ref: https://towardsdatascience.com/assumptions-of-linear-regression-5d87c347140
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from numpy import array
from numpy.linalg import inv
from numpy.linalg import pinv
from sklearn.datasets import make_regression
(a) Analytical Solution Using Linear Algebra:
# 1. Direct Analytical Solution
X, y = make_regression(n_samples=10, n_features=5, n_targets=1)
#X = np.hstack((X, np.ones((X.shape[0], 1), dtype=X.dtype)))
b = inv(X.T.dot(X)).dot(X.T).dot(y)
print(b)
# predict using coefficients
yhat = X.dot(b)
# plot data and predictions
plt.scatter(yhat, y)
plt.plot(yhat, y)
plt.show()
(b) Solution using Matrix Decomposition (SVD)
# 2. SVD Decomposition
X, y = make_regression(n_samples=10, n_features=5, n_targets=1)
#X = np.hstack((X, np.ones((X.shape[0], 1), dtype=X.dtype)))
b = pinv(X).dot(y)
print(b)
# predict using coefficients
yhat = X.dot(b)
# plot data and predictions
plt.scatter(yhat, y)
plt.plot(yhat, yhat, color='red')
plt.show()
from scipy.optimize import curve_fit
from pandas import read_csv
from numpy import arange
from scipy.optimize import curve_fit
# define the true objective function
def objective(x, a, b, c):
return a * x + b * x**2 + c
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/longley.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# choose the input and output variables
x, y = data[:, 4], data[:, -1]
# curve fit
popt, _ = curve_fit(objective, x, y)
# summarize the parameter values
a, b, c = popt
print('y = %.5f * x + %.5f * x^2 + %.5f' % (a, b, c))
# plot input vs output
plt.scatter(x, y)
# define a sequence of inputs between the smallest and largest known inputs
x_line = arange(min(x), max(x), 1)
# calculate the output for the range
y_line = objective(x_line, a, b, c)
# create a line plot for the mapping function
plt.plot(x_line, y_line, '--', color='red')
plt.show()
import pyforest
from sklearn import linear_model
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
X = np.random.randn(1000,5)
y = np.random.randn(1000,2)
X = pd.DataFrame(X)
X.columns = ['X1','X2','X3','X4','X5']
y = pd.DataFrame(y)
y.columns = ['y1','y2']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = linear_model.LinearRegression(fit_intercept=True, normalize=True)
model.fit(X_train,y_train)
y_train_pred = pd.DataFrame(model.predict(X_train))
y_train_pred.columns = ['y1_pred','y2_pred']
y_test_pred = pd.DataFrame(model.predict(X_test))
y_test_pred.columns = ['y1_pred','y2_pred']
print("R2 Scores = ", r2_score(y_test,y_test_pred,multioutput='raw_values'))
print("X Coefficients = " + str(model.coef_))
print("Bias = " + str(model.intercept_))
plt.scatter(y_test['y1'],y_test_pred['y1_pred'])
plt.plot(y_test['y1'],y_test['y1'], color='blue')
plt.xlabel('y_true',fontsize = 12)
plt.ylabel('y_pred',fontsize = 12)
Since none of the above assumptions for linear regression are satisfied, the fit is bad!
Steps in Machine Learning (Tabular Data):
Step 1: Import the data-set and libraries
Step 2: Identify numerical, categorical and datetime features
Step 3: Perform EDA - summary statistics, correlations, identification of null/missing values, univariate/bivariate analysis (chi2 analysis, point biserial correlation)
Step 4: Remove irrelevant/highly correlated features
Step 5: Train-Test split
Step 6: Missing value imputation, categorical feature encoding, scaling on train (fit_transform method) and apply the same to test (transform method)
Step 7: Oversampling/Undersampling and dimensionality reduction (if required)
Step 8: Outlier detection (and removal) on X_train, y_train
Step 9: Feature Engineering (Box-Cox, Yeo-Johnson transformation, LASSO, SISSO etc.)
Step 10: Set up models and evaluation metrics in an n-fold CV environment
Step 11: Fit the models and do hyperparameter tuning
Step 12: Use the best performing model to make predictions on unseen test dataset
Step 13: Add regularization/do further hyperparameter tuning to mitigate overfitting (if required)
Step 14: Use the best model with its optimal hyperparameters to fit on the entire dataset (train + test)
Step 15: During inference time, subject the inference data to preprocessing steps (Step 4, Step 6 and Step 9) and make live predictions using above model
Popular Non-linear Machine Learning Models:
(a) Decision Trees and Random Forest (Bagging)
https://towardsdatascience.com/decision-trees-explained-3ec41632ceb6
Main Hyperparameters: n_estimators, max_depth, criterion, max_features, class_weight
(b) AdaBoost and XGBoost (Boosting)
https://towardsdatascience.com/xgboost-mathematics-explained-58262530904a
https://towardsdatascience.com/boosting-algorithm-adaboost-b6737a9ee60c
https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db
Main Hyperparameters: base_estimator, n_estimators, learning_rate, class_weight
AdaBoost :
XGBoost:
(c) Support Vector Machines
https://towardsdatascience.com/support-vector-machine-simply-explained-fee28eba5496
Main Hyperparameters: C, gamma
(d) K-Nearest Neighbors
Main Hyperparameters: n_neighbors, weights, algorithm
(e) Kernel Ridge Regression
https://www.ics.uci.edu/~welling/classnotes/papers_class/Kernel-Ridge.pdf
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_regression.html
Main Hyperparameters: alpha, kernel, gamma
(f) Multi Layer Perceptron (ANN)
https://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf
https://www.analyticsvidhya.com/blog/2020/12/mlp-multilayer-perceptron-simple-overview/
Main Hyperparameters: hidden_layer_sizes, activation
(g) Gaussian Naive Bayes
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
Main Hyperparameters: None
import dtale
data = pd.read_csv(r'C:\Users\suryanaman.c\Desktop\restaurant-revenue-prediction\train.csv')
d = dtale.show(data)
#d.open_browser()
import pyforest
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import ADASYN
import warnings
warnings.filterwarnings('ignore')
#config parameters
path = r'C:\Users\suryanaman.c\Desktop\restaurant-revenue-prediction\train.csv'
categorical_features = ['City', 'City Group', 'Type']
numerical_features = ['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7', 'P8', 'P9', 'P10', 'P11',
'P12', 'P13', 'P14', 'P15', 'P16', 'P17', 'P18', 'P19', 'P20', 'P21',
'P22', 'P23', 'P24', 'P25', 'P26', 'P27', 'P28', 'P29', 'P30', 'P31',
'P32', 'P33', 'P34', 'P35', 'P36', 'P37']
date_feature = ['Open Date']
#function definitions
def data_fetch(path):
file_path = path
file = pd.read_csv(path)
return file
def data_prep(data):
thresh = len(data)*0.9
data = data.dropna(thresh = thresh, axis = 1)
# Read about correlations between features (Pearson for numerical-numerical, Point Biserial for Numerical-binary and Chi2 for Categorical)
# Link: https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223
# Link: https://towardsdatascience.com/point-biserial-correlation-with-python-f7cd591bd3b1#:~:text=Linear%20regression%20is%20a%20classic,have%20an%20almost%20linear%20relationship.&text=That%20is%20where%20point%20biserial%20correlation%20comes%20to%20our%20aid.
def imputation_encoding_and_train_test_split(data):
X = data[numerical_features+categorical_features+date_feature]
X[date_feature] = X[date_feature].astype('datetime64[ns]')
X['delta_time'] = pd.to_datetime("now") - X[date_feature]
X['delta_time'] = X['delta_time'].dt.days
X = X.drop(columns = date_feature)
date_feature_new = ['delta_time']
y = data['revenue']
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#X_train[numerical_features] = X_train[numerical_features].apply(pd.to_numeric, errors = 'coerce')
imputer_num = SimpleImputer(missing_values = np.nan, strategy = 'mean')
scaler = QuantileTransformer(n_quantiles = 10)
# Imputation and MinMax scaling for numerical features
X_train[numerical_features+date_feature_new] = imputer_num.fit_transform(X_train[numerical_features+date_feature_new])
X_train[numerical_features+date_feature_new] = scaler.fit_transform(X_train[numerical_features+date_feature_new])
X_test[numerical_features+date_feature_new] = imputer_num.transform(X_test[numerical_features+date_feature_new])
X_test[numerical_features+date_feature_new] = scaler.transform(X_test[numerical_features+date_feature_new])
# Mode Imputation and Dummy Encoding for categorical features
for col in categorical_features:
mode_category = X_train[col].mode()[0]
X_train[col] = X_train[col].fillna(mode_category)
X_test[col] = X_test[col].fillna(mode_category)
dummies_train = pd.get_dummies(X_train[categorical_features])
X_train = pd.concat([X_train[numerical_features+date_feature_new], dummies_train], axis = 1)
dummies_test = pd.get_dummies(X_test[categorical_features])
X_test = pd.concat([X_test[numerical_features+date_feature_new], dummies_test], axis = 1)
missing_cols = set(X_train.columns) - set(X_test.columns)
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X_test = X_test[X_train.columns]
return X_train, y_train, X_test, y_test
def oulier_removal(X_train, y_train):
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01)
# Link - https://towardsdatascience.com/outlier-detection-with-isolation-forest-3d190448d45e
X_tr_np = X_train.to_numpy()
y_tr_np = y_train.to_numpy()
yhat = iso.fit_predict(X_tr_np)
# select all rows that are not outliers
mask = yhat != -1
X_tr_np, y_tr_np = X_tr_np[mask, :], y_tr_np[mask]
X_train_new = pd.DataFrame(X_tr_np, columns = X_train.columns)
y_train_new = y_tr_np
return X_train_new, y_train_new
def feature_engineering(X_train, y_train, X_test, y_test):
from autofeat import FeatureSelector, AutoFeatRegressor
afreg = AutoFeatRegressor(verbose=1, feateng_steps = 2, featsel_runs = 1)
#Link - https://arxiv.org/pdf/1901.07329.pdf
X_train_af = afreg.fit_transform(X_train, y_train)
X_test_af = afreg.transform(X_test)
new = set(X_train_af.columns) - set(X_train.columns)
print("New Features after feature engineering: "+str(new))
return X_train_af, X_test_af
X_train, y_train, X_test, y_test = imputation_encoding_and_train_test_split(data)
#X_train, y_train = oulier_removal(X_train, y_train)
#X_train, X_test = feature_engineering(X_train, y_train, X_test, y_test)
return X_train, y_train, X_test, y_test
def model_train(X_train, y_train, X_test, y_test):
# Linear models
reg1 = GridSearchCV(LinearRegression(), {'fit_intercept': ['True', 'False']}, cv=5, scoring = 'r2')
reg2 = GridSearchCV(Ridge(), {'alpha':[0.1], 'fit_intercept': ['True', 'False']}, cv=5, scoring = 'r2')
reg3 = GridSearchCV(Lasso(), {'alpha':[0.1], 'fit_intercept': ['True', 'False']}, cv=5, scoring = 'r2')
reg4 = GridSearchCV(ElasticNet(), {'alpha':[0.01], 'l1_ratio':[0.5], 'fit_intercept': ['True', 'False']}, cv=5, scoring = 'r2')
reg5 = GridSearchCV(SGDRegressor(), {'alpha':[0.0001,0.01,0.1], 'max_iter': [10000]}, cv=5, scoring = 'r2')
# Ensemble models
reg6 = GridSearchCV(DecisionTreeRegressor(), {'max_depth':[25],'random_state':[0]}, cv=5, scoring = 'r2')
reg7 = GridSearchCV(RandomForestRegressor(), {'max_depth':[20], 'n_estimators':[100], 'bootstrap': [True], 'min_samples_leaf': [1], 'max_features':['auto'], 'criterion':['mse'], 'random_state':[0]}, cv=5, scoring = 'r2')
reg8 = GridSearchCV(AdaBoostRegressor(), {'base_estimator': [RandomForestRegressor()], 'n_estimators':[100], 'learning_rate':[1], 'loss':['exponential'], 'random_state':[0]}, cv=5, scoring = 'r2')
reg9 = GridSearchCV(XGBRegressor(), {'max_depth':[20], 'n_estimators':[100], 'random_state':[0]}, cv=5, scoring = 'r2')
# Kernel based models
reg10 = GridSearchCV(SVR(), {'kernel':['rbf'], 'C':[1], 'gamma':[0.001]}, cv=5, scoring = 'r2')
reg11 = GridSearchCV(KernelRidge(), {'kernel':['laplacian'], 'alpha':[0.1], 'gamma':[0.05]}, cv=5, scoring = 'r2')
# Other models
reg12 = GridSearchCV(KNeighborsRegressor(), {'n_neighbors':[3], 'weights':['distance'], 'algorithm':['auto'], 'leaf_size': [1,2,3], 'p':[2]}, cv=5, scoring = 'r2')
reg13 = GridSearchCV(MLPRegressor(), {'hidden_layer_sizes':(5,), 'max_iter':[10000]}, cv=5, scoring = 'r2')
#----------------------------------------------------------------------------------------------------------------
# Define regressor, fit and evaluate the model
reg = reg7
reg.fit(X_train,y_train)
score_train = np.mean(np.abs((y_train - reg.predict(X_train)) / y_train)) * 100
score_test = np.mean(np.abs((y_test - reg.predict(X_test)) / y_test)) * 100
score_train_rmse = np.sqrt(((reg.predict(X_train) - y_train) ** 2).mean())
score_test_rmse = np.sqrt(((reg.predict(X_test) - y_test) ** 2).mean())
y_train_pred = reg.predict(X_train)
y_test_pred = reg.predict(X_test)
print("Regression Model: \n")
print("MAPE on the training dataset is: " + str(score_train))
print("MAPE on the test dataset is: " + str(score_test))
print("\nRMSE on the training dataset is: " + str(score_train_rmse))
print("RMSE on the test dataset is: " + str(score_test_rmse))
return score_train, score_test, y_train, y_train_pred, y_test, y_test_pred
def visualize_results(y_train, y_pred_train, y_test, y_test_pred):
fig = plt.figure()
fig.set_figheight(10)
fig.set_figwidth(6)
ax1 = fig.add_subplot(211)
ax1.scatter(y_train, y_train_pred)
ax1.plot(y_train, y_train, color='red')
ax1.set_xlim([0, 10000000])
ax1.set_ylim([0, 10000000])
ax1.set_xlabel('Actual Revenue',fontsize = 12)
ax1.set_ylabel('Predicted Revenue',fontsize = 12)
#ax1.set_aspect('equal')
ax2 = fig.add_subplot(212)
ax2.scatter(y_test, y_test_pred)
ax2.plot(y_test, y_test, color='red')
ax2.set_xlim([0, 10000000])
ax2.set_ylim([0, 10000000])
ax2.set_xlabel('Actual Revenue',fontsize = 12)
ax2.set_ylabel('Predicted Revenue',fontsize = 12)
#ax2.set_aspect('equal')
plt.show()
# Combine all the operations and display
if __name__ == '__main__':
data = data_fetch(path)
X_train, y_train, X_test, y_test = data_prep(data)
score_train, score_test, y_train, y_train_pred, y_test, y_test_pred = model_train(X_train, y_train, X_test, y_test)
visualize_results(y_train, y_train_pred, y_test, y_test_pred)
import pyforest
import dtale
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, recall_score, precision_score, confusion_matrix, plot_confusion_matrix, plot_precision_recall_curve
from sklearn.svm import SVC
from imblearn.over_sampling import ADASYN
import warnings
warnings.filterwarnings('ignore')
#config parameters
path = r'C:\Users\suryanaman.c\Desktop\train.csv'
numerical_features = ['no_of_trainings', 'age', 'length_of_service', 'awards_won?', 'avg_training_score']
categorical_features = ['department', 'region', 'education', 'gender', 'recruitment_channel', 'previous_year_rating', 'KPIs_met >80%']
def data_fetch(path):
file_path = path
file = pd.read_csv(path)
return file
def data_prep(data):
X = data[numerical_features+categorical_features]
y = data['is_promoted']
print("Original Class distribution is: "+str(y.value_counts()[0])+":"+str(y.value_counts()[1]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify = y)
imputer_num = SimpleImputer(missing_values = np.nan, strategy = 'mean')
scaler = MinMaxScaler()
X_train[numerical_features] = imputer_num.fit_transform(X_train[numerical_features])
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
for col in categorical_features:
mode_category = X_train[col].mode()[0]
X_train[col] = X_train[col].fillna(mode_category)
X_test[col] = X_test[col].fillna(mode_category)
dummies_train = pd.get_dummies(X_train[categorical_features])
X_train = pd.concat([X_train[numerical_features], dummies_train], axis = 1)
X_test[numerical_features] = imputer_num.transform(X_test[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])
dummies_test = pd.get_dummies(X_test[categorical_features])
X_test = pd.concat([X_test[numerical_features], dummies_test], axis = 1)
missing_cols = set(X_train.columns) - set(X_test.columns)
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
X_test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
X_test = X_test[X_train.columns]
def minority_oversamling(X_train,y_train):
# Minority Oversampling using ADASYN (try SMOTE vs ADASYN)
oversampling = ADASYN(sampling_strategy = 0.9, random_state = 0, n_neighbors = 10)
X_train_oversample, y_train_oversample = oversampling.fit_resample(X_train, y_train)
return X_train_oversample, y_train_oversample
X_train, y_train = minority_oversamling(X_train,y_train)
print("Class distribution after oversampling is: "+str(y_train.value_counts()[0])+":"+str(y_train.value_counts()[1]))
return X_train, y_train, X_test, y_test
def model_train(X_train, y_train, X_test, y_test):
clf1 = GridSearchCV(RandomForestClassifier(random_state = 0, class_weight = 'balanced'), {'max_depth':[20],'n_estimators':[50],'criterion':['gini'],'random_state':[0]},cv=5)
clf2 = GridSearchCV(SVC(class_weight = 'balanced'), {'kernel':['rbf'],'C':[10],'gamma':[0.05]},cv=5)
clf1.fit(X_train,y_train)
score_train = f1_score(y_train, clf1.predict(X_train))
score_test = f1_score(y_test, clf1.predict(X_test))
print("Classification Model: \n")
print("F1-Score on the training dataset with 5-fold CV is: " + str(score_train))
print("\nF1-Score on the test dataset is: " + str(score_test))
print(confusion_matrix(y_test, clf1.predict(X_test)))
plot_precision_recall_curve(clf1, X_test, y_test)
return score_test
if __name__ == '__main__':
data = data_fetch(path)
X_train, y_train, X_test, y_test = data_prep(data)
score_test = model_train(X_train, y_train, X_test, y_test)